VRVis – ComVis: VAST 2010 Challenge Mini Challenge 1: Text Records

VRVis - ComVis

VAST 2010 Challenge
Text Records - Investigations into Arms Dealing

Authors and Affiliations:

Andreas Ammer, VRVis, ammer@vrvis.at
Denis Gracanin, Virginia Tech, gracanin@vt.edu
Zoltan Konyha, VRVis, konyha@vrvis.at
Kresimir Matkovic, VRVis, matkovic@vrvis.at
Cagatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no [PRIMARY contact]

Tool(s):

We used our interactive, multiple linked views visualization application ComVis in our analysis. ComVis can visualize scalar and categorical data in several different views. Each view is interactive and brushable. Brushes defined in the same view or in different views can be combined using Boolean operators (composite brushing).

We used Python scripts to:

- Split each document in the dataset into sub-reports in order to have more accurate relation mining

- Categorize words in our word-document matrix

- Automatically create a node-edge structure from the selected words and sub-reports to create files suitable for our graph drawing tool

We used R statistical computing system (http://www.r-project.org/) to do text mining. In R, we used the text mining package, tm (http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf). This package was used to:

- Remove stop-words (i.e. a, the, and etc.)

- Remove punctuation

- Remove suffixes (stemming)

- Provide a term-document matrix (200 terms * 5 documents) which contains the most frequent terms in documents along with their frequencies

We used Tulip graph visualization software (http://tulip.labri.fr) to visualize the graph structure we build automatically with Python scripts. Tulip software enables us to visualize the graph structure with different layouts and use a variety of visual properties to better analyze the relational data. In addition, this tool provides different forms of interaction with the graph.

Video:

Download video (6.2 MB)

ANSWERS:

MC1.1: Summarize the activities that happened in each country with respect to illegal arms deals based on a synthesis of the information from the different report types and sources. State the situation in each country at the end of the period (i.e. the end of the information you have been given) with respect to illegal arms deals being pursued. Present a hypothesis about the next activities you expect to take place, with respect to the people, groups, and countries.

Text Mining package provides us a term-document matrix which contains the frequencies of the most frequent terms per document, plus each term’s total frequency. The preprocessing steps (mentioned in the Tools section) enabled us to filter out unnecessary terms (i.e. stop words) and make more efficient frequency counts (i.e. by stemming and removing punctuation). Each term is classified into one of these five categories 1) General Term 2) City 3) Country 4) People & Organization 5) Month. First, each word is classified as a "General Term". Next, the word list is checked against the lists of country names (~246 countries), city names (~37000 cities) and month names (12 months) and classified. We marked the "People & Organization" names manually due to a limited number of such words.

After the classification phase, the frequency lists are exported to a ComVis suitable format. In ComVis, we employed

i) "Term frequency vs term" scatter plots involving the whole dataset.

ii) A histogram to indicate classification distribution of terms and to brush certain classes of terms.

iii) A parallel coordinate plot where each axis represents a document and each line is a term with their respective frequencies in each document. This visualization enables to see how terms are distibuted on different sources of information and to limit our analyses to certain types of documents.

iv) A tag cloud to highlight the terms that are brushed

Using basic linking & brushing, we can select subsets of words which come from certain frequency ranges and categories. Being able to see the frequencies over different document types enables the analyst to see how much certain actors or places are mentioned in different sources of information. Most frequent general terms generally gave us no clue about the story, however when we brushed a certain frequency range, we ended up with more interesting terms like: “parts, Carabobo, vwhombre, vwparts4salecheep, burj, etc.”, a comparative screenshot can be seen in Figure 1.1.

Figure 1.1 – First column visualizes term classifications, second column visualizes term frequencies and third column visualizes brushed terms. In both rows, “general term” subsets from different frequency ranges are brushed. In top row we see that most frequent terms are not very interesting. However, brushed subset in bottom row contains more interesting terms.

In order to start our country-wise analysis, we first brush country and place names and steer our analysis with respect to the resulting place list (Figure 1.2). Parallel coordinates shows that while telephone and newspaper records contain a large variety of place names with low frequency, intelligence reports contain fewer place names with higher frequency.

Figure 1.2 – City and country names brushed. In addition to the first three views, the parallel coordinate plot shows distribution of terms over documents.

Using the prominent words in documents, we ended up with the following country-wise analysis results:

Most of the information regarding Pakistan and Gaza are derived from intelligence reports and telephone intercepts. Figure 1.3 displays most of the prominent actors and places in these documents.

Pakistan – In Karachi the organization named Lashkar-e-Jhangvi is quite active. Bukhari and Bhutani are key people in this organization. They meet at apartments and mosques. They possibly ordered guns (possibly arrived in black boxes) and there were a lot of money transfers from Bukhari's account.

Gaza - Kasem, Khouri and Anka are the key people in Gaza. They are preparing for a mission and frequently talk on the phone.

Figure 1.3- Result of composite brushing sequence: 1) brush intelligence reports 2) add telephone reports by an OR brush 3) use a diff brush over the histogram to select only place and people names documents.

Russia - There are a number of gun dealers active in Moscow: Nicolai Kuryakin, Boonmee Khemkhaengare, Dombrowski. They are running business with contacts in Nigeria, Yemen and planning a meeting with people in Pakistan, Gaza, Turkey,Kenya and Venezuela in Dubai. They made a deal with Yemeni Salah Ahmed and received diamonds for gun shipments.

Turkey & Syria - There is a joint effort from people in Turkey and Syria to organize activities. They are looking for supplies and will join the meeting in Dubai with Russian gun dealers.

Nigeria - They are trying to make deals with Russian gun dealers. This relation and important actors were discovered by analyzing the email reports with a special interest on terms like “Mikhail, boyohotmailcom, joetomskauru” which we discovered as a result of the analysis in Figure 1.4.

Kenya – Thabiti Otieno and Nahid Owiti are suppliers for illegal arm trade, Otieno is working with Russians (in a telephone talk, Thabiti mentioned Nahid’s name) and selling arms to Sudan. They were planning to join the meeting in Dubai.

Figure 1.4 – Report on Email transactions were analyzed with a special interest on the terms in tag cloud display.

North Korea – Russian gun dealers buy guns from North Korea. A plane is intercepted in Thailand carrying guns from North Korea to Kiev.

Yemen - Saleh Ahmed is an arm dealer in Yemen and he is trying to purchase guns from Russians using his contact with Dombrowski.

Hypothesis for future activities: After the meeting in Dubai, all these different organizations in Pakistan, Gaza, Turkey, Syria and Yemen will probably make deals with the Russian gun dealers. We can expect to have more money transfers from these people to dealers in Russia. After the shipments are made, armed organizations will most likely organize terrorist activities in near future in Pakistan, Gaza, Turkey and Syria. We can expect a joint activity between people in Syria and Turkey as they had frequent communications on the phone. Organization named Lashkar-e-Jhangvi will most likely organize an event in Karachi. Kasem, Anka and Khouri will organize an armed event in Gaza.

MC1.2: Illustrate the associations among the players in the arms dealing through a social network. If there are linkages among countries, please highlight these as well in the social network. Our analysts are interested in seeing different views of the social network that might help them in counterintelligence activities (people, places, activities, communication patterns that are key to the network.

All the documents provided in this challenge consisted of smaller reports. Prior to our analyses, we first split the documents into separate reports. In order to build relation graphs, we followed the following procedure:

- The user selects a subset of the terms using Comvis and exports this subset

- Each pair combination in this subset is checked against a relation. Two terms are considered to be related if they appear in the same report.

- The relations are converted into a relational graph which is to be visualized in Tulip Software.

The nodes in the graph are color coded with respect to their categories as seen in Figure 2.1.

Figure 2.1 – Categories and their respective node colors.

This automatic graph building mechanism helped us to discover many relations by selecting different subsets. The graph structure we get by using all terms resulted in a cluttered graph which is very hard to interpret. Therefore, we employed smaller relational graphs which are constructed from subsets. This automatic approach has some drawbacks:

- Some people’s names and surnames appear as a node more than once i.e. Name and surname of Boonmee Khemkhaengare is displayed twice in the graph. This should be taken into consideration by the analyst.

- Some common names like Ali, Mohammed etc. can result in some artificial links in the graph but these nodes and their respective edges can be removed manually by the analyst.

- Typos can result in duplicate nodes in the graph, i.e. We have two nodes for “Kasam” and “Kasem” referring to same person. These duplications should be removed manually by the analyst. In Figure 2.3 this problem is annotated with selection 1.

A subset of the terms consisting of only “People & Organization” and “Place” names is displayed in Figure 2.2. In this figure we can see the following relations:

The structure of the organization in Karachi is visible in Sub-graph 1. They will have a meeting with the dealers in Dubai.
Sub-graph 2 indicates the active people in Gaza. It can be seen that they have relation with people in Moscow and are going to attend the meeting in Dubai.
Sub-graph 3 and 4 show the gun dealers in Moscow. They have relations with Yemen, Iran, Kiev (Ukraine), Nigeria, Gaza, Pakistan, Kenya and Thailand. These relations are visible by following the edges connecting place names.
In Sub-graph 5, we can see that Nahid is related to Kenya and Dubai. Using this relation, we were able to discover that activities in Kenya is also related to activities in Moscow and Kenyan parties are also attending the meeting in Dubai.
In Sub-graph 6, we see the relation between people in Syria and Turkey. Baltasar and Hakan are connected via place names Aleppo and Antalya.
In Sub-graph 7, we see a separate connection between Spain and Venezuela. The details of this relation are discovered later using a more detailed graph visualization.
Common place for all of the substructures in the graph is Dubai. This is due to the meeting of many parties with the Russian gun dealers.

Figure 2.2 – Graph to depict the relations between people and places. The selected sub-graphs indicate separate groups of relations

In Figure 2.3, we can see the above relation graph without the place names. This resulted in a clearer image of the overall relations. However, certain relations are not depicted due to lack of spatial information, i.e. the relation in sub-graph 6 in Figure 2.2 is not observable in Figure 2.3.

Figure 2.3 – Graph depicting only people and organization names. This graph is the result of brushing only “People and Organization” terms in ComVis. Graph production error due to typos in dataset is indicated with label 1.

The relations can be analyzed with deeper detail by adding nodes for general terms. In Figure 2.4, the graph consists of all terms mentioned in telephone intercept reports. We can clearly see the actors which communicate via telephone.

People in Turkey and Syria communicate through phone and “textbooks” is an important word in their talks, most probably meaning guns.
People in Gaza are communicating through telephone and it can be seen that they make their meetings in cafes and apartments.
From this graph, we discover one more relation between Iran and Ukraine. There are phone calls between contacts in these places and reading in detail for this relation revealed that the contact in Iran will also attend the meeting in Dubai (there is a link between Tebriz and Dubai in Figure 2.4).

Figure 2.4 – Graph to depict the relations made over telephone.

By utilizing a subset involving e-mail and message board reports, we ended up with the structure in Figure 2.5. We discovered the following information:

There is mail traffic between Mikhail Dombrowski in Moscow and people in Nigeria.
The resemblance between a nickname, jtomski, in message boards (named vwparts4salecheep and savvytraveladvice) and Dombrowski’s mail (Joetomsk@au.ru) most likely means that they are the same person. We see that this person is using message boards to make deals.
Dombrowski (using nickname jtomski) arranges a meeting with people having nicknames vwhombre and john. These two names are mentioned in the telephone calls between Venezuela and Spain (called hombre and john). So we can conclude that the parties in Venezuela and Spain are also related to dealers in Moscow and use message boards to communicate with Russians and use telephone to communicate between each other.
We can clearly see that term “parts” is a key word used in message boards.

Figure 2.5 – Graph to depict mail and message board relations

VRVis - ComVis

VAST 2010 Challenge Text Records - Investigations into Arms Dealing

Authors and Affiliations:

Tool(s):

VAST 2010 Challenge
Text Records - Investigations into Arms Dealing